Introduction to Jupyter Notebooks and Pandas

1. New to Jupyter notebooks?

  • This is a jupyter notebook file (file extension is .ipynb for python notebook versus .py for standard python scripts)
  • It is a format used by many data scientists and researchers for analysis and visualization tasks
  • Useful because (a) it is easier to work with and understand than single python script files (.py) and
  • (b) enables you to break problems down into steps, and quickly see those results in the same window
  • Key point: for data analysis a jupyter notebook is clear and efficient way to conduct analysis and communicate the findings

Two types of Jupyter notebook cells:

(1) Markdown cells (Cell->Cell Type->Markdown):

  • Descriptions
  • text (this entire cell that you are looking at is a Markdown cell where text exists)
  • images
  • Key point: Markdown cells are a place for notes and descriptions



(2) Code "cells" (Cell->Cell Type->Code) : Actual code

  • A code "cell" store and compile your code
  • any non-code text must be formatted as a comment - begin with "#"
  • Key point: code cells are the workhorse, they execute your program and output results

Running Jupyter notebook cells

(1) To run every line of code in the entire notebook

  • Cell->Run All

(2) To run only a single code cell

  • Click on the cell and hit Shift-Enter
In [2]:
# This is a code cell
# This is the place where your code is written
# Any lines beginning with the "#" are not code but descriptions to help you and others know what is being done
# hit shift-Enter together to run this cell
In [3]:
# This is another code cell
# Here we have some code written....it is a 'function' , let's not worry about details of 'functions' at the moment
#  it is just for seeing outputs
# in jupyter notebooks, codes show their outputs below the code cell that was run
def perserverance(text):
    return "so long as you do not stop!"


print(f"{perserverance('It does not matter how slowly you go...')}") 

# below this example code cell an output will appear
so long as you do not stop!
In [4]:
# another example

print("this will print a message to the output") 


# print("some text") messages help to a) show result of the code, b) understand what went wrong
this will print a message to the output

2. New to pandas?

  • Pandas is a powerful data manipulation library in Python
  • It is widely used in data science, machine learning, scientific computing, and many other data-intensive fields
  • Useful because (a) it provides flexible data structures for efficient manipulation of structured data and
  • (b) it has rich functionality for data cleaning, transformation, and analysis
  • Key point: for data analysis, pandas provides a high-performance, easy-to-use data structure (DataFrame) and data analysis tools.

Two main data structures in pandas:

(1) Series :

  • A one-dimensional labeled array capable of holding any data type
  • It is similar to a column in a spreadsheet, a field in a database, or a vector in a mathematical matrix
  • Key point: Series is the primary building block of pandas



(2) DataFrame :

  • A two-dimensional labeled data structure with columns potentially of different types.
  • It is similar to a spreadsheet or SQL table
  • Key point: DataFrame is the primary pandas data structure for data manipulation and analysis

Working with pandas

(1) To import the pandas library

  • import pandas as pd

(2) To create a DataFrame

  • df = pd.DataFrame(data)

(3) To read a CSV file into a DataFrame

  • df = pd.read_csv('file.csv')

(4) To get the first 5 rows of the DataFrame

  • df.head()

3. Load your packages

In [5]:
# uncomment the line below this if you want to see which packages you already have installed
#  The '!'  character below is used to run this directly in the notebook
# conda is the package manager recommended
# ! conda list 
In [6]:
# I recommend using conda to install install itables in your terminal 
# why in the terminal and not in the code cell? : because conda requires you to input a yes, no to execute the installation,
#  which may or may not show-up in the cell output
# In the terminal type: conda install itables
In [7]:
import pandas as pd # should already be pre-installed with conda

from itables import init_notebook_mode, show #to fully take advantage of the ability to display our data as a table format, let's use the itables library 

init_notebook_mode(all_interactive=True) #After this, any Pandas DataFrame, or Series, is displayed as interactive table, which lets you explore, filter or sort your data
In [8]:
# the following code cell will give you a nice hover highlighting on your tables when used with itables library
In [9]:
%%html
<style>
  .dataTables_wrapper tbody tr:hover {
    background-color: #6495ED; /* Cornflower Blue */
  }
</style>


<!-- #1E3A8A (Dark Blue) -->
<!-- #0D9488 (Teal) -->
<!-- #065F46 (Dark Green) -->
<!-- #4C1D95 (Dark Purple) -->
<!-- #991B1B (Dark Red) -->
<!-- #374151 (Dark Gray) -->
<!-- #B45309 (Deep Orange) -->
<!-- #164E63 (Dark Cyan) -->
<!-- #4A2C2A (Dark Brown) -->
<!-- #831843 (Dark Magenta) -->
<!-- #1E3A8A (Dark Blue ) -->

<!-- Suggested Light Colors for Light Backgrounds -->
<!-- #AED9E0 (Light Blue) -->
<!-- #A7F3D0 (Light Teal) -->
<!-- #D1FAE5 (Light Green) -->
<!-- #DDD6FE (Light Purple) -->
<!-- #FECACA (Light Red) -->
<!-- #E5E7EB (Light Gray) -->
<!-- #FFEDD5 (Light Orange) -->
<!-- #B2F5EA (Light Cyan) -->
<!-- #FED7AA (Light Brown) -->
<!-- #FBCFE8 (Light Magenta) -->

4. Read in the data

In [10]:
# here we will load a historical US tornado dataset using pandas

# you can give it any name you want...just has to follow python convention for names i.e., so-called snake case
tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv") 

print(tor_data.head()) 

# You can show the contents of this dataset as a table below by simply typing its name and running this code cell OR using print(data_set_name)
    om    yr  mo  dy        date      time  tz  st  stf  stn  ...   len  wid  \
0  192  1950  10   1   10/1/1950  21:00:00   3  OK   40   23  ...  15.8   10   
1  193  1950  10   9   10/9/1950   2:15:00   3  NC   37    9  ...   2.0  880   
2  195  1950  11  20  11/20/1950   2:20:00   3  KY   21    1  ...   0.1   10   
3  196  1950  11  20  11/20/1950   4:00:00   3  KY   21    2  ...   0.1   10   
4  197  1950  11  20  11/20/1950   7:30:00   3  MS   28   14  ...   2.0   37   

   ns  sn  sg   f1  f2  f3  f4  fc  
0   1   1   1   25   0   0   0   0  
1   1   1   1   47   0   0   0   0  
2   1   1   1  177   0   0   0   0  
3   1   1   1  209   0   0   0   0  
4   1   1   1  101   0   0   0   0  

[5 rows x 29 columns]
In [11]:
# now see how it looks as an interactive table. You can click on the arrows by the colum names to sort them...
tor_data.head() # note that you do not have to use show() to generate an interactive table unless you want to add addtional customizations 
Out[11]:
om yr mo dy date time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading... (need help?)
In [12]:
# To read any dataset into python you need to specify:

    # 1. WHAT - name to save the data to 
    # 2. WHO - is going to do the work? i.e., which package is going to do the work?, here we rely on pandas 'pd' package or library 
    # 3. HOW - read_csv(), the specific method that pandas will use 
    # 4. WHERE - is the file? the file is specified by its name with relative path (i.e. '.\') or by its full path, including its extension, in this case .csv is the extension


# (1)     (2)    (3)             (4) actual file name and its path is always in quotes
tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv") # Note: we use an 'r' path_name_here because 'r' allows us  to a) copy paste paths directly from your explorer, and mainly b) use '\' without causing error with python
In [13]:
# a good habit is to make  a copy of the original data BEFORE making any changes

orginal_tor_data=tor_data.copy() # now when we want to compare or revert to original data we have a way to do that
In [14]:
# use the itables package with the .show() method
# notice how we can customize the table
show(tor_data.head(), caption="Tornado Data 1950-2022", options={'hover': True})  # itables also allows you to add a title or caption to your table


# caption add a title to the table, it is below the table But still used full for descriptions
# add a hover highlighting function
  Cell In[14], line 3
    show(tor_data.head(), caption="Tornado Data 1950-2022". options={'hover': True})  # itables also allows you to add a title or caption to your table
                                                                   ^
SyntaxError: invalid syntax
In [ ]:
#  column_filters="header" extension for itables, my personal favorite for exploring table data 



show(tor_data, column_filters="header",layout={"topEnd": None}, maxBytes=0 ,options={'hover': True} )  # Adds individual column filters and removes the single default search bar (which isn't that useful )
omyrmodydatetimetzststfstnmaginjfatlossclossslatslonelatelonlenwidnssnsgf1f2f3f4fc
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# This is similar to an Excel or Google sheets table 
# we added column filters above each column that allows us to visually explore, AND because we set maxBytes=0...
# we can visually search through the entire dataset!!
# note that setting maxBytes=0 consumes a lot of memory for large datasets, as you are loading the full dataset (So, use maxBytes= 0 wisely!)
In [ ]:
# add buttons to export the data to your table too!

show(tor_data[(tor_data['yr']==2011)],buttons=["copyHtml5", "csvHtml5", "excelHtml5"], options={'hover': True}) # to fully make use of this we need to filter the data to contain what we want then we can search by  a specific value
om yr mo dy date time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
show(
    tor_data,
    fixedColumns={"start": 1, "end": 2},
    scrollX=True, options={'hover': True}
)

# this is a neat way to lock columns you want to keep an eye on, and scan across the table
om yr mo dy date time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# and for a conditional approach to exploring the data table visually
# use the searchBuilder extension of itables
# this let's you add multiple conditions to filter down the data

show(
    orginal_tor_data,options={'hover': True}
    layout={"top1": "searchBuilder"},maxBytes=0,
    searchBuilder={
        "preDefined": {
            "criteria": [
                {"data": "inj", "condition": "=", "value": [""]}
            ]
        }
    },
)
om yr mo dy date time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# Very cool!!
# this is a good way to sanity check and refine any code-based approach
# With this table setup, you can gain a good understanding about what the data is like
In [ ]:
# Now that we have a good handle on the data
# let us confirm the type of data we are using 

type(tor_data)

# a dataframe is the technical name for the data table with rows and columns that pandas has created for us
Out[ ]:
pandas.core.frame.DataFrame
In [ ]:
# this is the table of data we read in 
# it has over 68 thousand rows and 29 columns
# first column has no name, BUT that column contains all the row labels of the data, these row labels are called the index
In [ ]:
# another way to know how many rows, columns

tor_data.shape
Out[ ]:
(68701, 29)
In [ ]:
# what if you want to know how many rows, columns, also the data type of each column?

tor_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68701 entries, 0 to 68700
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   om      68701 non-null  int64  
 1   yr      68701 non-null  int64  
 2   mo      68701 non-null  int64  
 3   dy      68701 non-null  int64  
 4   date    68701 non-null  object 
 5   time    68701 non-null  object 
 6   tz      68701 non-null  int64  
 7   st      68701 non-null  object 
 8   stf     68701 non-null  int64  
 9   stn     68701 non-null  int64  
 10  mag     68701 non-null  int64  
 11  inj     68701 non-null  int64  
 12  fat     68701 non-null  int64  
 13  loss    68701 non-null  float64
 14  closs   68701 non-null  float64
 15  slat    68701 non-null  float64
 16  slon    68701 non-null  float64
 17  elat    68701 non-null  float64
 18  elon    68701 non-null  float64
 19  len     68701 non-null  float64
 20  wid     68701 non-null  int64  
 21  ns      68701 non-null  int64  
 22  sn      68701 non-null  int64  
 23  sg      68701 non-null  int64  
 24  f1      68701 non-null  int64  
 25  f2      68701 non-null  int64  
 26  f3      68701 non-null  int64  
 27  f4      68701 non-null  int64  
 28  fc      68701 non-null  int64  
dtypes: float64(7), int64(19), object(3)
memory usage: 15.2+ MB
In [ ]:
# Scanning through, we focus on second to last row, showing 3 data types: float, int, and objects
# that tells you how much of each datatype there is , mostly int64 datatype i.e., integer or whole numbers. 
# Then float or decimal and finally objects which is words

# Any missing data? or non-nulls?,
# Not Surprisingly, we learned that there are no empty rows
    # how do we know? Because all the columns say '68701 non-null' , which is the same number of rows as the entire dataframe
    # there are some caveats to what we define as 'missing' but here we just mean that the row value is blank

5. Check for missing data

In [ ]:
# you are the skeptical type, yes? Good...
# see for yourself if we have missing data

tor_data.isnull().sum() # for more on how this works , do this in two steps i.e, tor_data.isnull(), and then print that alone i.e, it is summing true (null=true) and falses for each columns
Out[ ]:
0
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# why do we care about missing data?
    # it is problem if we want to make fair comparisons
    # it complicates the meaning of the results

# Why care about data type?
    # Pandas has special methods that only apply to certain types of data i.e., dates , times
In [ ]:
# let's read in the data again this time let's tell pandas to make the row labels the date

tor_data = pd.read_csv(r".\Data\1950-2022_actual_tornadoes.csv", parse_dates= ['date'])
In [ ]:
tor_data.info() 

# check out the date- columns data type now
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68701 entries, 0 to 68700
Data columns (total 29 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   om      68701 non-null  int64         
 1   yr      68701 non-null  int64         
 2   mo      68701 non-null  int64         
 3   dy      68701 non-null  int64         
 4   date    68701 non-null  datetime64[ns]
 5   time    68701 non-null  object        
 6   tz      68701 non-null  int64         
 7   st      68701 non-null  object        
 8   stf     68701 non-null  int64         
 9   stn     68701 non-null  int64         
 10  mag     68701 non-null  int64         
 11  inj     68701 non-null  int64         
 12  fat     68701 non-null  int64         
 13  loss    68701 non-null  float64       
 14  closs   68701 non-null  float64       
 15  slat    68701 non-null  float64       
 16  slon    68701 non-null  float64       
 17  elat    68701 non-null  float64       
 18  elon    68701 non-null  float64       
 19  len     68701 non-null  float64       
 20  wid     68701 non-null  int64         
 21  ns      68701 non-null  int64         
 22  sn      68701 non-null  int64         
 23  sg      68701 non-null  int64         
 24  f1      68701 non-null  int64         
 25  f2      68701 non-null  int64         
 26  f3      68701 non-null  int64         
 27  f4      68701 non-null  int64         
 28  fc      68701 non-null  int64         
dtypes: datetime64[ns](1), float64(7), int64(19), object(2)
memory usage: 15.2+ MB
In [ ]:
tor_data['date'].dtypes

#dtype('<M8[ns]') is equivalent to a datetime64 data type in pandas, which is used for date and time data
Out[ ]:
dtype('<M8[ns]')

6. Process the data

In [ ]:
tor_data.set_index('date') # this shows us the result of the data when it is indexed by date
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# # Setting the date as the index is helpful for a time series type anaylsis 
# BUT we did not actually yet change the index to be the date
# this is only a view of the data, it has not really been modified

# if we wanted to make a plot of tornadoes over time we would get an error because we did not change the index of the data to date
# curious to see the error? try this to see an error when tyring to plot the data : tor_data.plot()
In [ ]:
# We need to set a special parameter to make sure we actually change the data 
# This is 'inplace' = True parameter to ensure the index is set to date on the tor_data

tor_data.set_index('date', inplace=True) 
In [ ]:
# want to see all the columns without using itables?
# Set display options

pd.set_option('display.max_columns', None) # None means show all the columns, but you can use a number to show only that number of columns

7. Summarize the data

In [ ]:
tor_data.head() # look at only the top of the data , you can add an arguement into .head() to show that many rows
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
tor_data.head(20) # add an arguement to see more rows
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
tor_data.tail() # last only rows of data
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# need quick statiscal summary of the data?

tor_data.describe()
Out[ ]:
om yr mo dy tz stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# need to see summary for specific columns?
# add double brackets and list your columns in quotes,
# dataframe_name[['column1', 'column2']]

tor_data[["mag", "len", "wid", "loss"]].describe()
Out[ ]:
mag len wid loss
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# so just using the .describe() method we can access a bunch of statistical summary data for mean, min, max, etc.
# for example, now we can see that the average length a tornado travels across the entire dataset is 3.48 miles!

8. Access specific subsets of data : Filtering

In [ ]:
# only interested in a certain year?
# filter the data by that year
tor_data[(tor_data['yr']==2011)]
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)

9. Sorting the data

In [ ]:
#  Say we are interested in injuries in 2011, a year known for having the largest tornado outbreaks i.e., multiple tornado events on the same day
#  So, maybe we should sort by the highest injuries within that known year for tornado outbreaks? 

# filter the data by that year, and then sort by injuries 
tor_data[(tor_data['yr']==2011)].sort_values(by='inj', ascending=False)
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# Say within 2011, we are only interested in a certain subset of that year?
# Then filter the data even further. 
# Looking at 2011 tornado events AND those magnitude 2 and above
tor_data[(tor_data['yr']==2011) & (tor_data['mag']>=2)].sort_values(by='inj', ascending=False) # notice how we are chaining together the commands? Filter by year AND mag >=2, then sort injuries highest to lowest
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)

10. Selecting specific rows with .loc()

  • use to select rows and columns by their name
  • structure
    • DataFrame.loc[rows, columns]
    • rows can be a single label, a list of labels, or a boolean (a condition that the rows must meet)
    • columns can be single , list , or slice object like this df.loc[:, 'column1':column3']
In [ ]:
# if we are very interested in that specific date
tor_data.loc[(tor_data.index == '2011-04-27')] # note if the date was not the index we would use: df.loc[(df['date'] == '2022-01-01')]
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# Select only a specific row label 
# slice only the first few columns up to mag
tor_data.loc[tor_data.index == '2011-04-27',  :'mag'] # ':' means select all , combined with mag means select all up to and including mag 
Out[ ]:
om yr mo dy time tz st stf stn mag
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# specify the names for only those columns you need
tor_data.loc[tor_data.index == '2011-04-27',  ['st','mag','inj','fat', 'time']] # notice how we are using brackets around the column name list
Out[ ]:
st mag inj fat time
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# if we are very interested in that specific date and the tornadoes occuring with magnitude 2 and up
tor_data.loc[(tor_data.index == '2011-04-27') & (tor_data['mag'] >= 2)] # note if the date was not the index we would use: df.loc[(df['date'] == '2022-01-01') & (df['mag'] >= 2)]
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# up until now we were simply filtering the data down to a subset of interest
# and then viewing that subset
# usually what you want is to save that subset of interest into a variable 
# so you can (a) not have to repeat the filtering each time you want to see this subset, (b) focus on it separately from the rest of the data

# if we are very interested in that specific date and the tornadoes occuring with magnitude 2 and up
tor_data_2011_mag2= tor_data.loc[(tor_data.index == '2011-04-27') & (tor_data['mag'] >= 2)] # note if the date was not the index we would use: df.loc[(df['date'] == '2022-01-01') & (df['mag'] >= 2)]
In [ ]:
tor_data_2011_mag2.head().sort_values(by="inj", ascending=False) # What is wrong with this?
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# the order in which you do things matters...
# The outcome is correct if we sort first, then take a look at the data
tor_data_2011_mag2.sort_values(by="inj", ascending=False).head() # notice the first row is different from above?
Out[ ]:
om yr mo dy time tz st stf stn mag inj fat loss closs slat slon elat elon len wid ns sn sg f1 f2 f3 f4 fc
date
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# 1500 injuries occurred associated with this tornado event on April 04 2011 in Alabama! 

11. Basic plotting in Pandas

In [ ]:
# Want to understand the data by year?
# So, group your data by year 
# How? using groupby

# 1. groupby is a method that can group by a variable column and within that group count by another column
# 2. here we group by year "yr" column
# 3. within those groups select the "om" column
# 4. Count up the distinct values in the omm column
# 5. call the plot method that is built-into pandas

#            (1)  (2)   (3)    (4)     (5)
tor_data.groupby("yr")["om"].count().plot()
Out[ ]:
<Axes: xlabel='yr'>
In [ ]:
# what did we do here?
# we plotted the data by the year column over time using the builtin pandas plotting .plot()
# it should choose the most logical plot for the data 
In [ ]:
# what we are actually plotting is a line plot

tor_data.groupby("yr")["om"].count().plot(kind='line')
Out[ ]:
<Axes: xlabel='yr'>
In [ ]:
# .plot() has different parameters, one is kind, you can set this to 'line' or something else... 
In [ ]:
# what if we had a special criteria ?
# only show the data for the years with magnitude greater than or equal to 3

tor_mag_3_data =tor_data[(tor_data["mag"]>=3)] 
In [ ]:
tor_mag_3_data.groupby("yr")["om"].count().plot(kind='line')
Out[ ]:
<Axes: xlabel='yr'>
In [ ]:
# plots the mag 3 and up data
In [ ]:
tor_mag_3_data.groupby(["st"])["om"].count().sort_values(ascending=False).plot(kind='bar')
Out[ ]:
<Axes: xlabel='st'>
In [ ]:
# bar plot for the magnitude 3 and up tornadoes grouped by state
# kind is the bar plot
# Options include 'line' (default), 'bar', 'barh', 'hist', 'box', 'kde', 'density', 'area', 'pie', 'scatter', 'hexbin'
# legend: Whether to show the legend. Defaults to True for line plots and False otherwise.
In [ ]:
# Always good to cross check your plots in another way
# here we start with the dataset for magnitude 3 and up
# Then we are filtering the tor_mag_3_data for those rows that have state as 'TX' , i.e., texas
 
tor_mag_3_data[tor_mag_3_data["st"]=="TX"].count()
Out[ ]:
0
Loading ITables v2.1.0 from the init_notebook_mode cell... (need help?)
In [ ]:
# bar plot results for mag 3 and up:
# Very clear that Texas has the most tornadoes among those with mag 3 and up
# Interesting...
#   But what about overall torndaoes?  Texas still at the top?
In [ ]:
# add additional parameters to add labels, change figure size and color etc.
tor_data.groupby("st")["om"].count().sort_values(ascending=False).plot(
    kind='bar', title="Tornado Count by State: 1950-2022", rot=45, legend=True, figsize=(14,8), color='r', grid=False, xlabel="state", ylabel="Tornado counts", fontsize=14)

# Overall, Texas still has the most tornadoes
Out[ ]:
<Axes: title={'center': 'Tornado Count by State: 1950-2022'}, xlabel='state', ylabel='Tornado counts'>
In [ ]:
# A lot that can be customized even using only this built-in pandas plotting function .plot()!
    # pandas is using matplotlib 'under the hood' to generate these plots
# But even more can be done by using matplotlib library directly 
# Or with another libray Seaborn
#  both are specialized for creating very visually sophisticated plots
#... until next time!

links

social